我们研究了基于SGD的深神经网络(DNN)的优化是否可以适应高度准确且易于压缩的模型。我们提出了一种新的压缩意识的最小化器,称为CRAM,它以原则性的方式修改了SGD训练迭代,以产生在压缩操作(例如减肥或量化)下局部损失行为稳定的模型。标准图像分类任务的实验结果表明,CRAM产生的密集模型比标准SGD型基准线更准确,但在重量修剪下令人惊讶的是稳定的:例如,对于Imagenet上的Resnet50,CRAM训练的模型可能会损失到。他们的重量的70%一次性只有微小的精度损失。
translated by 谷歌翻译
转移学习是一种经典范式,通过该范式,在大型“上游”数据集上佩戴的模型适于在“下游”专业数据集中产生良好的结果。通常,据了解,“上游”数据集上的更准确的模型将提供更好的转移精度“下游”。在这项工作中,我们在想象的神经网络(CNNS)的背景下对这种现象进行了深入的调查,这些现象已经在想象的数据集上训练的情况下被修剪 - 这是通过缩小它们的连接来压缩。具体地,我们考虑使用通过应用几种最先进的修剪方法而获得的非结构化修剪模型的转移,包括基于幅度的,二阶,重新增长和正规化方法,在12个标准转移任务的上下文中。简而言之,我们的研究表明,即使在高稀稀物质,稀疏的型号也可以匹配或甚至优于致密模型的转移性能,并且在此操作时,可以导致显着的推论甚至培训加速度。与此同时,我们观察和分析不同修剪方法行为的显着差异。
translated by 谷歌翻译
深度神经网络(DNN)的计算要求增加导致获得稀疏,且准确的DNN模型的兴趣。最近的工作已经调查了稀疏训练的更加困难的情况,其中DNN重量尽可能稀少,以减少训练期间的计算成本。现有的稀疏训练方法通常是经验的,并且可以具有相对于致密基线的准确性较低。在本文中,我们介绍了一种称为交替压缩/解压缩(AC / DC)训练DNN的一般方法,证明了算法变体的收敛,并表明AC / DC在类似的计算预算中准确地表现出现有的稀疏训练方法;在高稀疏水平下,AC / DC甚至优于现有的现有方法,依赖于准确的预训练密集模型。 AC / DC的一个重要属性是它允许联合培训密集和稀疏的型号,在训练过程结束时产生精确的稀疏密集模型对。这在实践中是有用的,其中压缩变体可能是为了在资源受限的设置中进行部署而不重新执行整个训练流,并且还为我们提供了深入和压缩模型之间的精度差距的见解。代码可在:https://github.com/ist-daslab/acdc。
translated by 谷歌翻译
Task-oriented dialogue (TOD) systems have been applied in a range of domains to support human users to achieve specific goals. Systems are typically constructed for a single domain or language and do not generalise well beyond this. Their extension to other languages in particular is restricted by the lack of available training data for many of the world's languages. To support work on Natural Language Understanding (NLU) in TOD across multiple languages and domains simultaneously, we constructed MULTI3NLU++, a multilingual, multi-intent, multi-domain dataset. MULTI3NLU++ extends the English-only NLU++ dataset to include manual translations into a range of high, medium and low resource languages (Spanish, Marathi, Turkish and Amharic), in two domains (banking and hotels). MULTI3NLU++ inherits the multi-intent property of NLU++, where an utterance may be labelled with multiple intents, providing a more realistic representation of a user's goals and aligning with the more complex tasks that commercial systems aim to model. We use MULTI3NLU++ to benchmark state-of-the-art multilingual language models as well as Machine Translation and Question Answering systems for the NLU task of intent detection for TOD systems in the multilingual setting. The results demonstrate the challenging nature of the dataset, particularly in the low-resource language setting.
translated by 谷歌翻译
Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. Our analysis suggests that future MT metrics be designed to produce error labels rather than scores to facilitate extrinsic evaluation.
translated by 谷歌翻译
Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural language models. These models often make inferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model's pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution tasks that require reasoning over multiple facts. Our dataset is organized into subtasks that differ in terms of which knowledge sources contain relevant facts. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources.
translated by 谷歌翻译
Monocular Depth Estimation (MDE) is a fundamental problem in computer vision with numerous applications. Recently, LIDAR-supervised methods have achieved remarkable per-pixel depth accuracy in outdoor scenes. However, significant errors are typically found in the proximity of depth discontinuities, i.e., depth edges, which often hinder the performance of depth-dependent applications that are sensitive to such inaccuracies, e.g., novel view synthesis and augmented reality. Since direct supervision for the location of depth edges is typically unavailable in sparse LIDAR-based scenes, encouraging the MDE model to produce correct depth edges is not straightforward. In this work we propose to learn to detect the location of depth edges from densely-supervised synthetic data, and use it to generate supervision for the depth edges in the MDE training. %Despite the 'domain gap' between synthetic and real data, we show that depth edges that are estimated directly are significantly more accurate than the ones that emerge indirectly from the MDE training. To quantitatively evaluate our approach, and due to the lack of depth edges ground truth in LIDAR-based scenes, we manually annotated subsets of the KITTI and the DDAD datasets with depth edges ground truth. We demonstrate significant gains in the accuracy of the depth edges with comparable per-pixel depth accuracy on several challenging datasets.
translated by 谷歌翻译
Detecting personal health mentions on social media is essential to complement existing health surveillance systems. However, annotating data for detecting health mentions at a large scale is a challenging task. This research employs a multitask learning framework to leverage available annotated data from a related task to improve the performance on the main task to detect personal health experiences mentioned in social media texts. Specifically, we focus on incorporating emotional information into our target task by using emotion detection as an auxiliary task. Our approach significantly improves a wide range of personal health mention detection tasks compared to a strong state-of-the-art baseline.
translated by 谷歌翻译
We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
translated by 谷歌翻译
The health mention classification (HMC) task is the process of identifying and classifying mentions of health-related concepts in text. This can be useful for identifying and tracking the spread of diseases through social media posts. However, this is a non-trivial task. Here we build on recent studies suggesting that using emotional information may improve upon this task. Our study results in a framework for health mention classification that incorporates affective features. We present two methods, an intermediate task fine-tuning approach (implicit) and a multi-feature fusion approach (explicit) to incorporate emotions into our target task of HMC. We evaluated our approach on 5 HMC-related datasets from different social media platforms including three from Twitter, one from Reddit and another from a combination of social media sources. Extensive experiments demonstrate that our approach results in statistically significant performance gains on HMC tasks. By using the multi-feature fusion approach, we achieve at least a 3% improvement in F1 score over BERT baselines across all datasets. We also show that considering only negative emotions does not significantly affect performance on the HMC task. Additionally, our results indicate that HMC models infused with emotional knowledge are an effective alternative, especially when other HMC datasets are unavailable for domain-specific fine-tuning. The source code for our models is freely available at https://github.com/tahirlanre/Emotion_PHM.
translated by 谷歌翻译